{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TensorFlow Training and Serving in SageMaker \"Script Mode\"\n",
    "\n",
    "Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model.\n",
    "\n",
    "Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). In this example, we will show how easily you can train a SageMaker using TensorFlow 1.x and TensorFlow 2.x scripts with SageMaker Python SDK. In addition, this notebook demonstrates how to perform real time inference with the [SageMaker TensorFlow Serving container](https://github.com/aws/sagemaker-tensorflow-serving-container). The TensorFlow Serving container is the default inference method for script mode. For full documentation on the TensorFlow Serving container, please visit [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the environment\n",
    "\n",
    "Let's start by setting up the environment:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install TensorFlow and SageMaker\n",
    "_Note:  Ignore Warnings and Errors Below_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip uninstall -y tensorflow tensorflow-estimator tb-nightly tf-estimator-nightly # tensorboard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 uninstall -y tensorflow tensorflow-estimator tb-nightly tf-estimator-nightly # tensorboard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install tensorflow==2.0.0 --upgrade --ignore-installed --no-cache --user # tensorboard==1.14.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "!pip3 install sagemaker --upgrade --ignore-installed --no-cache --user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install requests==2.20.1 --user"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Restart the Kernel to Recognize New Dependencies Above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display_html\n",
    "display_html(\"<script>Jupyter.notebook.kernel.restart()</script>\", raw=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "!pip3 list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the SageMaker Session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "sagemaker_session = sagemaker.Session()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup the Service Execution Role and Region\n",
    "Get IAM role arn used to give training and hosting access to your data.  See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "role = get_execution_role()\n",
    "print('RoleARN:  {}\\n'.format(role))\n",
    "\n",
    "region = sagemaker_session.boto_session.region_name\n",
    "print('Region:  {}'.format(region))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Data\n",
    "\n",
    "The MNIST dataset has been loaded to the public S3 buckets ``sagemaker-sample-data-<REGION>`` under the prefix ``tensorflow/mnist``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region)\n",
    "print(original_training_data_uri)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Copy the Training Data to Your Notebook Disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "local_data_path = './data'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!aws --region {region} s3 cp --recursive {original_training_data_uri} {local_data_path}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are four ``.npy`` file under this prefix:\n",
    "* ``train_data.npy``\n",
    "* ``eval_data.npy``\n",
    "* ``train_labels.npy``\n",
    "* ``eval_labels.npy``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!ls {local_data_path}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Upload the Data to S3 for Distributed Training Across Many Workers\n",
    "We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.\n",
    "\n",
    "This is S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket = sagemaker_session.default_bucket()\n",
    "data_prefix = 'sagemaker/tensorflow-mnist/data'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_data_uri = sagemaker_session.upload_data(path=local_data_path, bucket=bucket, key_prefix=data_prefix)\n",
    "print(training_data_uri)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls --recursive {training_data_uri}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train\n",
    "https://sagemaker.readthedocs.io/en/stable/using_tf.html#distributed-training\n",
    "\n",
    "This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.\n",
    "\n",
    "At the end of the training job we have added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training Script"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls ./src/mnist_keras_tf2.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can add custom Python modules to the `src/requirements.txt` file.  They will automatically be installed - and made available to your training script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat ./src/requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train with SageMaker `TensorFlow` Estimator\n",
    "\n",
    "https://sagemaker.readthedocs.io/en/stable/using_tf.html#distributed-training\n",
    "\n",
    "The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:\n",
    "\n",
    "* `py_version` is set to `'py3'`\n",
    "* `script_mode` is set to `True`\n",
    "* `distributions` is used to configure the distributed training setup.  It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs.  Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters.  To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch.  You can find the full documentation on how to configure `distributions` [here](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training).\n",
    "\n",
    "Notes:  \n",
    "* This example uses two(2) `ml.p3.2xlarge` instances.  You will likely need to request a SageMaker instance limit increase from Support before continuing.\n",
    "\n",
    "* Alternatively, you can specify `ml.c5.2xlarge` or choose only one(1) `ml.p3.2xlarge`.\n",
    "\n",
    "* To recognize the `requirements.txt`, we must include `src/setup.py` per [this](https://github.com/aws/sagemaker-python-sdk/issues/911) GitHub issue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.tensorflow import TensorFlow\n",
    "\n",
    "model_output_path = 's3://{}/sagemaker/tensorflow-mnist/training-runs'.format(bucket)\n",
    "\n",
    "mnist_estimator = TensorFlow(entry_point='mnist_keras_tf2.py',\n",
    "                             source_dir='./src',\n",
    "                             output_path=model_output_path,\n",
    "                             role=role,\n",
    "                             train_instance_count=1,\n",
    "                             train_instance_type='ml.c5.2xlarge',\n",
    "#                             train_instance_type='local',\n",
    "                             framework_version='2.0.0',\n",
    "                             py_version='py3',\n",
    "                             enable_sagemaker_metrics=True,\n",
    "                             script_mode=True,\n",
    "                             distributions={'parameter_server': {'enabled': True}},\n",
    "                             tags = [{'Key' : 'Project', 'Value' : 'tensorflow-mnist'},\n",
    "                                     {'Key' : 'TensorBoard', 'Value' : 'dist'}],\n",
    "                             # Assuming the loglines from the TensorFlow training job are as follows:\n",
    "                             #    Test loss    : 0.0635053280624561\n",
    "                             #    Test accuracy: 0.9821\n",
    "                             metric_definitions=[\n",
    "                                 {'Name': 'test:loss', 'Regex': 'Test loss    : ([0-9\\\\.]+)'},\n",
    "                                 {'Name': 'test:accuracy', 'Regex': 'Test accuracy: ([0-9\\\\.]+)'},\n",
    "                             ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `fit` the Model (Approx. 15 mins)\n",
    "\n",
    "To start a training job, we call `estimator.fit(training_data_uri)`.\n",
    "\n",
    "An S3 location is used here as the input. `fit` creates a default channel named `'training'`, which points to this S3 location. In the training script we can then access the training data from the location stored in `SM_CHANNEL_TRAINING`. `fit` accepts other parameters, as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mnist_estimator.fit(inputs=training_data_uri, wait=False)\n",
    "\n",
    "training_job_name = mnist_estimator.latest_training_job.name\n",
    "print('training_job_name:  {}'.format(training_job_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After some time, or in a separate Python notebook, we can attach to the running job using the `training_job_name`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.tensorflow import TensorFlow\n",
    "\n",
    "mnist_estimator = TensorFlow.attach(training_job_name=training_job_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws --region {region} s3 ls --recursive {model_output_path}/{training_job_name}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model_output_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Option 1:  Predict Directly in the Notebook"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use TensorFlow Core to load the model from `model_output_path`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws --region {region} s3 ls --recursive {model_output_path}/{training_job_name}/output/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws --region {region} s3 cp {model_output_path}/{training_job_name}/output/model.tar.gz ./model/model.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "saved_model_path = './model/000000001'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -rf {saved_model_path}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls ./model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!tar -xzvf ./model/model.tar.gz -C ./model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the model and list the prediction signatures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "loaded_model = tf.saved_model.load(export_dir='./model/000000001/')\n",
    "\n",
    "loaded_model.signatures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show the prediction signature details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from tensorflow.python.tools import saved_model_cli\n",
    "\n",
    "parser = saved_model_cli.create_parser()\n",
    "args = parser.parse_args(['show', '--dir', saved_model_path, '--tag_set', 'serve', '--signature_def', 'serving_default'])\n",
    "saved_model_cli.show(args)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Perform inference with the signature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "train_data = np.load('{}/train_data.npy'.format(local_data_path))\n",
    "train_labels = np.load('{}/train_labels.npy'.format(local_data_path))\n",
    "\n",
    "eval_data = np.load('{}/eval_data.npy'.format(local_data_path))\n",
    "eval_labels = np.load('{}/eval_labels.npy'.format(local_data_path))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "image_idx = 4444 # random idx number\n",
    "\n",
    "plt.imshow(eval_data[image_idx].reshape(28, 28), cmap='Greys')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predict_fn = loaded_model.signatures[\"serving_default\"]\n",
    "\n",
    "result = predict_fn(input_1=tf.constant(eval_data[image_idx].reshape(1, 28, 28, 1)))\n",
    "prediction = result['output_1'].numpy().argmax()\n",
    "\n",
    "label = eval_labels[image_idx]\n",
    "print('Prediction is {}, Label is {}, Matched: {}'.format(prediction, label, prediction == label))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Option 2:  Create a SageMaker Endpoint and Perform REST-based Predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deploy the Trained Model to a SageMaker Endpoint (Approx. 10 mins)\n",
    "\n",
    "After training, we use the `TensorFlow` estimator object to build and deploy a `TensorFlowPredictor`.  This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.\n",
    "\n",
    "As mentioned above we have implementation of `model_fn` in the `tensorflow_mnist.py` script that is required. We are going to use default implementations of `input_fn`, `predict_fn`, `output_fn` and `transform_fm` defined in [sagemaker-tensorflow-containers](https://github.com/aws/sagemaker-tensorflow-containers).\n",
    "\n",
    "The `deploy()` method creates a SageMaker model, which is then deployed to an endpoint to serve prediction requests in real time.  We will use the TensorFlow Serving container for the endpoint, because we trained with script mode.  This serving container runs an implementation of a web server that is compatible with SageMaker hosting protocol.  The [Using your own inference code](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-main.html) document explains how SageMaker runs inference containers.\n",
    "\n",
    "The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint.  These do not need to be the same as the values we used for the training job.  For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in `mnist.py`.  Here we will deploy the model to a single `ml.p3.2xlarge` instance.  Alternatively, you can use a `ml.c5.2xlarge` instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.c5.2xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Invoke the Endpoint\n",
    "\n",
    "Let's download the training data and use that as input for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "train_data = np.load('{}/train_data.npy'.format(local_data_path))\n",
    "train_labels = np.load('{}/train_labels.npy'.format(local_data_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The formats of the input and the output data correspond directly to the request and response formats of the `Predict` method in the [TensorFlow Serving REST API](https://www.tensorflow.org/serving/api_rest). SageMaker's TensforFlow Serving endpoints can also accept additional input formats that are not part of the TensorFlow REST API, including the simplified JSON format, line-delimited JSON objects (\"jsons\" or \"jsonlines\"), and CSV data.\n",
    "\n",
    "In this example we are using a `numpy` array as input, which will be serialized into the simplified JSON format. In addition, TensorFlow Serving can also process multiple items at once as you can see in the following code. You can find the complete documentation on how to make predictions against a TensorFlow Serving SageMaker endpoint [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine the prediction result from the TensorFlow 2.0 model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = predictor.predict(train_data[:50])\n",
    "for i in range(0, 50):\n",
    "    prediction = predictions['predictions'][i]\n",
    "    label = train_labels[i]\n",
    "    print('Prediction is {}, Label is {}, Matched: {}'.format(prediction, label, prediction == label))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (Optional) Cleanup Endpoint\n",
    "\n",
    "Let's delete the endpoint we just created to prevent incurring any extra costs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker.Session().delete_endpoint(predictor.endpoint)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
